Shrinkage estimators for prediction out-of-sample: 
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Abstract 

Wc find that, in a linear model, the James-Stein estimator, which dominates the maximum-likelihood 
estimator in terms of its in-sample prediction error, can perform poorly compared to the maximum-likelihood 
estimator in out-of-sample prediction. We give a detailed analysis of this phenomenon and discuss its 
implications. When evaluating the predictive performance of estimators, we treat the regressor matrix in 
the training data as fixed, i.e., we condition on the design variables. Our findings contrast those obtained 
by Baranchik (1973, Ann. Stat. 1:312-321) and, more recently, by Dicker (2012, |arXiv:1102.2952[ ) in an 
unconditional performance evaluation. 
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1 Introduction 

The problem of in-sample prediction, i.e., estimating the regression function at the observed design points, 
is arguably among the most extensively studied topics in regression analysis. But methods designed to 
perform well for in-sample prediction need not perform well for out-of-sample prediction, i.e., for estimating 
the regression function at a new point. We study the out-of-sample predictive performance of the James-Stein 



estimator, which dominates the maximum-likelihood estimator in the in-sample scenario; see Stein ( 1956 1 or 



the comprehensive monograph Judge and Bock (19781. We focus on the James-Stein estimator because of 
its conceptual importance (and because it is amenable to a detailed analytical analysis). The James-Stein 
estimator is the first method that was found to dominate maximum-likelihood through shrinkage, a discovery 
that helped to spark the development of many of the powerful estimation methods available today that rely 
on some sort of shrinkage through, e.g., regularization, model selection, or model averaging; see Le eb and| 
Potscher (20081 for a survey. In this paper, we find that the James-Stein estimator can perform poorly 



compared to the maximum-likelihood estimator in out-of-sample prediction, and we analyze and explain this 
phenomenon. 

Consider the Gaussian linear regression model 



Y = X/3- 



(1) 



where A is a fixed n x p matrix of rank p, (3 € W, n > p > 3, and u ~ N(0, u 2 l n ). For simplicity, we focus 
on the known variance case and we assume that a 2 = 1. Given an estimator (3 for j3, the corresponding 
in-sample prediction error, i.e., the mean squared error when estimating A/3 by X/3, will be denoted by 
Pi((3, (3,X) and is defined by 



1 



E 



(X/3 - X0)'(Xfi - Xfi) 



= E 



(2) 



For out-of-sample prediction, consider a new set of explanatory variables, i.e., a p- vector Xq, that is indepen- 
dent of Y, and hence also independent of j3, and that satisfies E[xo] = and E^oXq] = S, where S is positive 
definite and is regarded as a nuisance parameter. (In case of fixed xq, e.g., xq — € K p , we end up with 
the one-dimensional estimation target x^' [3, and it is well known that the maximum-likelihood estimator for 
Xq'(3 is unique admissible minimax; see Lehmann and Casella (1998j).) The out-of-sample prediction error is 



1 



the mean squared error when x' {3 is used to predict Xq(3, where now the mean is taken with respect to both 
Y and xq. This error will be denoted by p2(/3, (3, X) and is defined by 



Pa(P,P,X) = E 



(x' o - x' Q pf 



= E 



(3) 



(Of course, the out-of-sample prediction error p 2 (f3, (3, X) also depends on the matrix E, although this 



dependence is not explicitly shown in our notation.) We note that the existing results of Baranchik (19731 



and Dicker (20121 consider prediction errors by assuming that X is random and by taking expectations 
as in (|2j and (|3j also with respect to X. We, on the other hand, compute prediction errors by treating 
X as fixed, i.e., we condition on the design. The expressions on the far right-hand sides of Q and ^ 
differ in the matrices X'X/n and E. If X'X/n is very close to S, then the in-sample prediction error 
will be close to the out-of-sample prediction error. But if X'X/n is not very close to S, then the in- 
sample predictive performance as measured by pi(f3, (3, X) can be markedly different from the out-of-sample 
predictive performance as measured by p 2 (/3, ft, X). 

In this paper, we compare the maximum-likelihood estimator, the James-Stein estimator, and related 
shrinkage-type estimators by their performance as out-of-sample predictors. For in-sample prediction, i.e., 
in terms of the risk pi(-,-,X), it is well known that the James-Stein estimator dominates the maximum- 
likelihood estimator. But for out-of-sample prediction, i.e., in terms of P2(-, •, A), we find that the James- 
Stein estimator can perform quite poorly compared to the maximum-likelihood estimator; see Figure 1, 
relation ([6|, and also relation (|8| in Theorem |4j (This finding contrasts a result of Baranchik ( 1973[ ) as 
discussed at the end of Section |2| and after Theorem [3] in Section [3]) But we also find that such disappointing 
worst-case performance of the James-Stein estimator is atypical, in a certain sense; see relation ^ in 
Theorem |4j 

For the case where E in |3| is known, estimators that dominate the maximum-likelihood estimator in 
terms of the risk (|3| are well known, and we refer to Strawderman (20031 and the references given therein. 



et al. 


(1977 


i , and |Copas 


(1983 



For the challenging case where E is unknown and not estimable (in the 
sense that no further structural restrictions are imposed on E and that p/n is large, a scenario that we study 
in Section pH), we are not aware of further relevant existing results. 

Explicit finite-sample formulae for the out-of-sample prediction errors of the estimators in question are 
derived in Section 2. In Section 3, we present approximations to the finite-sample quantities of interest; our 
approximations become accurate as n — > oo, uniformly in the underlying parameters. Conclusions are drawn 
in Section [4] and the more technical derivations are collected in the appendices. 



2 Explicit finite-sample results 

Recall that the max imum-likelihood estimator of (3 is {3ml = (X' 'X)~ 1 X'Y ~ N(f3, (X'X)^ 1 ). As pointed 
(1956), James-Stein-type shrinkage estimators here correspond to estimators (3(c) of f3 with 



Stein 



out by 

/3(c) = [1 — cpj (fi'j^jX' 'X0ml)]/3ml, where c > is a tuning-parameter (cf. Appendix A). In particular, the 
traditional James-Stein estimator corresponds to the estimator (3(c) with c = (p — 2)/p and will be denoted 
by f3js; in the following, (3js will also be called the James-Stein estimator (of (3). 

Proposition 1. The in-sample prediction error and the out-of-sample prediction error of the James-Stein- 
type shrinkage estimator /3(c) satisfy 



Pl (j3(c),P,X) = 

with Pi(/3ml, P, X) =p/n, and 
p 2 0(c),/3,X) = 



1 



p 1 ((3 M L,/3,X)--[2cp(p-2) 



2 21 
C p 



E 



1 



P'mlX' X/3 M l 



(4) 



P20ml,(3,X) - 2cptrace(E(X'AT 1 )E 



1 



P'mlX'X[3 



ML 



+ (c 2 p 2 + 4cp) E 



(f3' ML X'Xf3 ML ) 



(5) 



2 



with piifiML-, P, X) = trace(S(X'X) l ), respectively. 



The well-known formula on the right-hand side of Q shows that the in-sample prediction error of /3(c) 
equals the in-sample prediction error of (3ml minus the product of a positive expected value and a polynomial 
in c and p. In particular, pi(p(c), P, X) is smaller than Pi($ml, P,X) if c satisfies < c < 2(p — 2)/p and 
is minimized for c = (p — 2)/p, which is the tuning-parameter used by (3js- Moreover, it is easy to see that 
pi0(c),P,X) depends on j3 and X only through p'(X'X/n)p. Unlike the formula of pi0(c), p, X) in @, 
display ^ shows that p 2 (P(c), [3, X) is obtained from P2(Pml, P, X) by subtracting a positive term and then 
adding another positive term, i.e, the second and the third term on the right-hand side of that depend 
on c, on p, and on the unknown parameters in a more complicated fashion. By further inspection, we find 
that (|5| depends on /} and X through f3'(X'X/n)P and through /3'S/3 (which can be viewed as a kind of 
signal-to-noise ratio); for details, see Proposition | A. 1 



o 

CO 



in 
c\i 



o 




signal-to-noise-ratio 



Figure 1: The solid curves in black 
and gray show finite sample rel- 
ative prediction errors as a func- 
tion of /3'£/3, and are explained in 
the next paragraph. The dashed 
curve shows the approximation to 
the relative prediction error that is 
obtained from Theorem [2] in Sec- 
tion [3] The constant solid line at 
1 is for reference. 



Figure 1 exemplifies the relative out-of-sample prediction error of the James-Stein estimator and of the 
maximum-likelihood estimator, i.e., p 2 {PjSi PiX)/ p%{fiMLi Pi X), as a function of /?'S/3 for various config- 
urations in parameter space. For the figure, we selected a scenario where X'X/n is not very close to E, 
so that the in-sample and the out-of-sample predictive performance of estimators differ from each other (as 
noted in the discussion after (|3|). In particular, we took n = 200, p = 160, S = I pi and X was obtained by 
sampling i.i.d. standard normals. The solid curves show p2(/3js, ft, X)/p2($ML, P, X) as a function of /3'£/3, 
for (3 parallel to various eigenvectors of X'X/n. Let Wi be the eigenvector corresponding to the eigenvalue 
vi of X'X/n, and assume that v\ < ■ ■ ■ < v\§q. Four solid black curves stay below 1 and appear to be 
ordered; starting from the top, these correspond to /3 parallel to u>i60! Wi20> wm, an d waq- The fifth solid 
black curve, that exceeds 1, corresponds to (3 parallel to w\, i.e., the eigenvector of the smallest eigenvalue. 
This curve attains a maximum of 4.27 at 75.12 (which is off the chart) and then recedes back towards 1 
as P'TiP — > oo. The gray curves are obtained in the same way but for eigenvectors corresponding to the 
remaining smallest 25% of eigenvalues. The curves in Figure 1 differ dramatically depending on whether 
they correspond to small eigenvalues like v\ on the one hand, and moderate-to-large eigenvalues like z/40, 



3 



^80) ^120) an d ^160 on the other. Repeating these computations with X replaced by a new independent 
sample, we obtained essentially the same results. And for other choices of p and n, we obtained results that 
are qualitatively similar, including maxima above 1 corresponding to small eigenvalues. This phenomenon 
becomes less pronounced as p/n decreases, and it disappears completely for very small values of p/n. The 
results in Section [3] entail that this is no surprise. 

Figure 1 shows that the James-Stein estimator no longer dominates the maximum-likelihood estimator 
for out-of-sample prediction, in the sense that 



p 2 ({3j S ,(3,X) > p 2 (/3 M L,P,X) 



(6) 



for some /? g W, if X is the design matrix used to generate the figure. (Indeed, the left-hand side of the 
preceding display exceeds the right-hand side by a factor of 4.27 for appropriately chosen (3, as noted in the 
preceding paragraph.) This should be compared to the findings of Baranchik (19731: The results in that 
paper suggest that 



E 



p 2 (f3 JS ,f3,X) 



< 



E 



p 2 ([3 M L,(3,X) 



(7) 



for each f3 € M. p , with strict inequality for some j3, if X is random with i.i.d. N(0, E)-distributed rows, and 
where the expectation in Q is taken with respect to X (see also Dicker (2012)). Comparing the preceding 
two displays, we see that for X fixed, cf. ([6|, /3jg can perform poorly for some (SeM p . But on average with 



(19731 and 



Dicker 



20121, the relation in ^ suggests that /§ 



JS 



respect to X, as considered in |Baranchik 
performs well, irrespective of /?. The performance of f3j$ hence depends crucially on whether we condition 
on the design as in (16]) or average with respect to the design distribution as in (0. We give a more detailed 
analysis and explanation of this phenomenon in the next section. Also note that the phenomenon in ([6]) and 
(JTl) is related to the ancillarity paradox of Brown (1990). 



3 Asymptotic approximations 

In this section, we provide approximations to quantities like p 2 ((3js, /3, X) and sup« p 2 ((3js, P, X) for 'typical' 
design matrices X. Here, 'typical' means 'in probability' when the explanatory variables in the training 
period, i.e., the rows of X, are taken as realizations from the same distribution as those in the prediction 
period (i.e., xq). Our approximations are uniform in the unknown parameters and become accurate as 
n — > 00, where the dimension of the model considered at sample size n, i.e., p, is allowed to depend on sample 
size. Note that quantities like p 2 (j3(c), f3, X) and p 2 (/3ml, ft, X) now become random variables through their 
dependence on X. We emphasize that our evaluation of performance is always taken conditional on X, and 
that the random design is used only to describe the behavior for 'typical' design matrices X. We also stress, 
with X random, that the expectations in Q and §5§ are now to be understood as conditional on X. 

For each n and p under consideration (n > p > 3), we assume that the model holds; that xq and 
X are independent of the error u in ([TJ); and that the rows of X and also xq are i.i.d. with mean zero and 
positive definite covariance matrix E. In addition, we assume that X can be written as X = UE 1 / 2 , where 
E 1 / 2 denotes a symmetric square root of E and where V is the n x p matrix obtained by taking the upper 
left block of a double infinite array {Vi t j)i>ij>i of i.i.d. random variables that have mean zero, variance one, 



and a finite fourth moment (cf. Bai and Silverstein ( 2010[ )). Finally, we also assume that the (marginal) 



distribution of the Vij's is absolutely continuous with respect to Lebesgue measure. This assumption could 
be dropped altogether, but at the expense of longer and technically more involved proofs. Under these 
assumptions, we note that X'X is invertible almost surely. We set p 2 (f3(c), /3, X) = on the probability-zero 
event where X'X is degenerate. Otherwise, the random variable p 2 (f3(c), f3, X) is defined by the expression 
on the right-hand side of (j5|, where the expected values are to be understood as conditional on X. These 
conventions also cover p 2 ($ML,P, X) and p 2 (f3js, P, X) in view of (3ml — $(0) and f3js = $((p — 2)/p). The 
following two results provide simple asymptotic approximations to quantities like p 2 (J3js, /3, X) as well as 

SUP/JGRP P20JS,P, X). 

Theorem 2. Assume that n — > 00, and that p = p(n) is such that p/n -> f e [0, 1). Moreover, for each p, 
let j3 and E fc a p-vector and a positive definite p x p matrix, respectively, so that /3'E/3 — >■ S 2 £ [0, 00} as 
n^-oo. Then P2 {Pml, ft, X) — > tj '(1 — t) and p 2 (/3(c), j3, X) — » r{8 2 ,c,t) in probability, where 

, p2 t ( t \ 2 n t 2 S 2 

r(5 ' c '* } = —t { 1 - c tTp) +C ¥TW 
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if d 2 is finite, and where r(oo, c, t) = t/(l — t) otherwise ( expressions like t/(t + S 2 ) are to be interpreted as 
zero if t and 5 2 are both equal to zero). Convergence of p2((3(c), f3, X) is uniform in c over compact sets, 
in the sense that sup 0<c<c |p2 (/3(c) , j3, X) — r(<5 2 ,c, i)\ = op(l) for any C > 0. Moreover, these statements 
continue to hold if p2($(c), (3, X) is replaced by the estimator r{5 2 ,c,p/n) where o 2 — max{Y'Y/n — 1,0}. 

Theorem 3. Assume that n — > oo, and that p = pin) is such that p/n — > t G [0, 1). Moreover, for each 
p, let £ be a positive definite p x p matrix. Then sup^ggp p2(0(c), (3, X) — ¥ sup^xj R(S 2 ,c,t) in probability, 
where 

R(S 2 ,c,t) = 1-c + c 2 F 

V ' ' ^ l-t \ t + 5 2 ) ( 1 _^)2( i + 5 2)2 

(again, expressions like t/(t + S 2 ) are to be interpreted as zero if t and S 2 are both equal to zero). Convergence 
is uniform in c over compact sets, so that sup 0<c<c . | sup^gjjp /?2(/3(c), (3, X) — svcp S 2 >0 R(5 2 ,c,t)\ = Op(l) 
for any C > 0. 

Remark. The quantities r(S 2 ,c,t) and R(5 2 ,c,t), as defined in Theorems [2] and [3J respectively, differ by a 
factor of (1 — y/t) 2 in the denominator of the last term. The proofs reveal that this factor is caused by the 
fact that (3'{X'X/n)f3 can differ considerably from its expectation if p/n is not small, because of the gap 
between the smallest eigenvalue of a large-dimensional random Wishart matrix (or Gram matrix) and the 
smallest eigenvalue of its expectation, a well-known phenomenon in the theory of random matrices; see, for 
example, |Bai and Silverstein| ( |2010[ ). In particular, in the setting of Theorem J2I with 5 2 — 1, f3'{X' X/n)j3 
converges to one in probability, but vnib-b'Y,b=i b'(X' X/n)b converges to (1 — \f£y. 

The approximations to p2(f3(c), (3, X) and P20ML, (3, X) provided by Theorem [2] i.e., r(S 2 ,c, t) and 
</(l — i), respectively, are such that r(S 2 ,c,t) < t/(l — t) whenever t > 0, provided only that < c < 2 



and S 2 < 00; cf. the dashed line in Figure 1. The results of Dicker (20121 suggest approximations to 
E[p 2 (/3(c), ft, X)] and E[/9 2 (/3a-/l, ft, X)] that coincide with r(S 2 ,c, t) and t/{\ — t), respectively. 

Theorems [2] and [3] together with the attending remark also provide us with a more precise description of 
the phenomenon in (|6]) and Q, and with a better understanding of the underlying cause. More formally, we 
have the following result. 

Theorem 4. Assume that n — > 00, and that p = p(n) is such that p/n — > t G [0, 1). Moreover, for eachp, let 
£ be a positive definite p x p matrix. Ift> 1/9, then the out-of-sample prediction errors of the James-Stein 
estimator and of the maximum-likelihood estimator are such that 

P{ bupp20js,I3,X)-P20ml,P,X) > e] -> 1 (8) 
\/3eKf j 

for some e > (that is given explicitly in the proof). And if t < 1/9, then the expression on the left-hand 
side of Q converges to zero as n — > 00 for each e > 0. Finally, irrespective of the value of t G [0, 1), we 
have 

sup PU(fe,P)- ft (fc,P) > e) -> (9) 

for each e > 0. 

More generally, consider tuning-parameters c n > that converge to a limit c G [0, 00) as n — > 00, and 
consider the expression on the left-hand side of (T8J) with f3(c n ) replacing f3.js- The resulting expression 



in 



converges to one as n — > 00 for some e > in the case where < c < 2 and t > [(c — 2)/(c + 2)] 2 , and 
the case where c > 2 and t > 0. In all other cases, the resulting expression converges to zero for each e > 0. 
And (|9| holds for each e > with (3(c n ) replacing (3js if c < 2, irrespective oft. 

Remark. In the setting of Theorem |4j it is easy to see that relation ^ holds uniformly over all pairs of n 
and p subject to 1/9 + 5 < p/n < 1 — 5; in addition, ^ holds uniformly over all such pairs with p/n < 1 — 5. 
for each 5 > 0, subject to n > p > 3 (and also uniformly over all positive definite p x p matrices S). Similar 
statements also apply with j3(c n ) replacing /3js, mutatis mutandis. 

Through relations Q and Q , Theorem [4] provides two complementing views on the worst-case perfor- 
mance of James-Stein-type shrinkage estimators. If the expression on the left-hand side of ((8|) is large, then 
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the James-Stein estimator (3js is typically outperformed, from a worst-case perspective, by the maximum- 
likelihood estimator $ml (whose out-of-sample prediction error is constant in f3). Here, 'typically' means 
that for most realizations of the design matrix X there is a parameter (3 for which (3jg performs worse than 
0ml- By Theorem [i] we see that this occurs with probability approaching one in the statistically challenging 
case where t > 1/9 (while this occurs with probability approaching zero in the case where t < 1/9). On 
the other hand, if the expression on the left-hand side of ([9| is small, then with high probability X is such 
that (3js outperforms $ml, uniformly in (3. In the setting of Theorem [4] the left hand-side of ([£]) always 
converges to zero. 

If Pii'i'iX) is used as a risk-function, then (JsJ) entails that (3js is outperformed by Pml m terms of 
worst-case risk, for most realizations of the design matrix X. This is a worst-case perspective, as is often 
adopted in frequentist statistical analyses. But the most unfavorable parameter f3, for which p2((3js, (3, X) 
is maximized, heavily depends on X. In particular, the relation in (|9| entails, for any fixed parameter (3, 
that the probability, that X is such that f3 is unfavorable, is small. 



Discussion 



We have derived explicit finite sample formulae for the out-of-sample prediction error of the James-Stein 
estimator and of related James-Stein-type shrinkage estimators in a linear regression model with Gaussian 
errors and fixed design. In an example with a particular design matrix X, we have found that the James-Stein 
estimator no longer dominates the maximum-likelihood estimator. We have shown that this phenomenon 
generally occurs for most design matrices X if the ratio of the number of explanatory variables in the model 
(p) and the sample size (n) exceeds 1/9, in the sense of statement ([8| of Theorem |4j At the same time, we 
have also shown that the James-Stein estimator outperforms the maximum-likelihood estimator for most 
design matrices X, uniformly in the underlying parameters, in the sense of statement (|9| of Theorem [4] 
Our findings suggest that the James-Stein estimator can perform poorly for prediction out-of-sample from 
a frequentist worst-case perspective. But our findings also suggest, in the setting considered here, that the 
worst-case performance does not properly reflect the performance in the typical case, and that the James- 
Stein estimator performs favorably compared to maximum-likelihood in the typical case, uniformly in the 
underlying parameters. 

The phenomenon, that the James-Stein estimator dominates the maximum-likelihood estimator for in- 
sample prediction but can fail to do so for prediction out-of-sample, is linked to the fact that the eigenvalues 
of X'X/n can differ from the eigenvalues of S; see relations |2| and (J3j> , as well as the remark following 
Theorem [3] We therefore expect that other estimators, that are designed to perform well for in-sample 
prediction, can exhibit similar phenomena when used for prediction out-of-sample. This includes various 
shrinkage estimators like estimators based on model selection, penalized maximum-likelihood, and other 
forms of regularization. Although beyond the scope of this paper, it would be particularly interesting to 



study bridge estimators (Frank and Friedman 19931, in particular, the LASSO (Tibshirani, 19961 and ridge 



regression (Horl and Kennard 19701, as well as the Dantzig selector (Candes and Tao 20071 with regards 



to their performance as predictors out-of-sample when p/n is not small. 



A Technical details for Section [2] 

Recall that (3 M l ~ N((3, (X'X)- 1 ), and define Z and ( as Z = (X'Xf^ftuh and ( = (X'X) 1 / 2 ft, respec- 
tively, where (X'X) 1 / 2 denotes a symmetric square root of X'X. Note that we have Z ~ N((,I p ), that the 
unknown parameter ft 6 W° corresponds to the unknown parameter ( 6 R p via ft = (X' X)- 1 / 2 ^, and that 
any estimator ft for ft corresponds to an estimator ( for £ via the relation ft — (X'X)- 1 / 2 ^, and vice versa. 
In the Gaussian location model Z ~ N((,I p ) with £ <G W, p > 3, consider the James-Stein-type shrinkage 
estimator £(c) = (1 — cp/Z' Z)Z with tuning parameter c > 0. The estimator £(c) for £ corresponds to the 
estimator (X'A) -1 /%(c) for (3, and it is elementary to verify that (X'X) -1 /%(c) equals /3(c). 

Proof of Proposition [TJ Define Z, ( and £(c) as in the preceding paragraph. For the in-sample prediction 
error of the maximum-likelihood estimator, note that Pi(/3ml,P, X) = (1/n) E[(Z — Q'(Z — ()] = p/ n. For 
relation fl4), reca ll that ((c) satisfies E[(C(c) - C)'(C(c) - ()] = p - [2cp(p - 2) - c 2 p 2 } E[1/Z'Z}; cf. 



James 



and Stein (19611. To verify that this equality is equivalent to (kt|, first note that the expression on the 



G 



left-hand side of this equality satisfies E[(£(c) — C)'(C( C ) — 0] = n Pi($( c )> Pi X) (by plugging the definitions 
of C( c )i Z an d C into the left-hand side, and by recalling the definitions of /3(c) and pi($(c), /3, X)). And, 
in a similar fashion, the right-hand side of that equality is equal to the expression on the right-hand side of 
Q, multiplied by n. 

For QSJ) , we first use (|3| together with the definition of /3(c) to obtain that 



p 2 (j3(c),l3,X) 



P 2(Pml,/3,X) - 2cpE 



P' ml VWml - g) 
P'ml-X'XPml 



cVE 



CP' ML x'xp ML f 



(10) 



The expected value in the second term on the right-hand side of (10) can be written as 

p 



E 



Z'T(Z - C) 
Z^Z 



= E E 



V 



E E 

i,j=l 



ZjTj,i(Zi — Q 
VP 72 



(11) 



where Z and £ have been defined earlier, and where T = (X'X) 1 / 2 H{X' X) 1 / 2 . For each of the terms in 
the first sum on the right-hand side of (111, we obtain that 

ZiTi t i{Zi — Ci) 



E 



E 



2E 



Ti^Z i 



upon using Stein's Lemma conditional on the Zfc's with k ^ i. And each of the terms in the second (double) 
sum on the right-hand side of (111 can be written as 

E 



ZjTj.i {Zi — £j) 



v^p 72 



2E 



(ELi^fc) 2 



again by using Stein's Lemma conditional on the Z^'s with k ^ i. Using the preceding two equalities, we 



can write the right-hand side of (11) as 

Z'TZ 



trace(T) E 



1 

~Z i Z 



2E 



{z'zy 



trace(S(A'A)" 1 )E 



1 

P'mlX'XPml 



2E 



(p' ML x'x/3 ML y 



where the equality is obtained by using the definitions of T and Z. Recall that (111 or, equivalently, the 
right-hand side of the preceding display, is equal to the expected value in the second term on the right- 
hand side of (10). Plugging this into (JTo|) and simplifying, we obtain (|5|. Furthermore, P20ML, /3, X) = 
trace(SE[(/3 ML - 0)0 ML - 0)'] = trace(E(X'A)- 1 ). □ 

The next result shows how Q and ^ depend on (3 and on X; moreover, that result can also be used to 
compute these expressions numerically. 

Proposition A.l. Assume that the linear model in ([!} holds and that S is a positive definite p x p matrix. 
Then the expected values in Q and Q can be written as 



E 



E 



P'mlX'XPml 
(P'ml x ' x Pml 



= E 



= E 



trace(E(A'X)" 1 ) 

{xl +2 (P'X'xp)f\ 



E 



/3'E/3 



(xl +A (P'X'xp)Y\ 



where, for each k > 1 and each X > 0, the symbol xlW denotes a random variable that is chi-square 
distributed with k degrees of freedom and non-centrality parameter X. 

Proof. The first equality follows upon recalling that X/3ml follows a Gaussian distribution with mean Xf3 
and covariance matrix X(X' X)~ 1 X' . The second equality is obtained by applying Corollary 2 from the 
appendix of Bock ( 1975[ ). □ 
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B Technical details for Section [3] 



We start with some auxiliary results, and then prove the results from Section [3] 
Lemma B.l. For t > 0, a = (1 — \/i) 2 and b = (1 + Vt) 2 , we have 

b 1 rr, w ^, „ min{l,i| 

^y/(b-x)(x-a)dx = 2tT ^ X _^ , 

where the expression on the right-hand side of the preceding display is to be interpreted as oo in case t = 1. 
Proof. The case t = is trivial. In the case where t > 0, the integral of interest can be written as 
b 



y/Jb — x) (x — a)dx = 2(6 — a) 

r 

2(b-a) 2 / 
Jo 



d 



2; 



a: 2 Jo (az 2 + 6) 2 (z 2 + 1) 

a 1 6 1 11 



(6 - a) 2 (az 2 + 6) b - a (az 2 + b) 2 (b - a) 2 z 2 + 1 



dz, 



where the first equality is obtained by the substitution z = \/ (b — x) / {x — a), and the second equality 
follows from a partial fraction decomposition. Recall that J Q (u 2 + l) _1 du = arctan(ui) and that 2k J Q (u 2 + 
l) _ ( 1 + fe )du = w(w 2 + l)~ k + (2k — 1) Jq(u 2 + l)~ fc du whenever k > 0. With this, and for the case where 
t 7^ 1, it follows that the integral on the far right-hand side of the preceding display reduces to 



2(6 -a) 2 



7T 



(b - a) 2 2^alVb b-a Ay/EVb^ - a) 2 2 



2y/a\/b 



(Vb-Va) 2 



Now note that ^[a = 1 — \ft in case t < 1, while y/a = \ft — 1 in case t > 1. This implies that the expression 
on the far right-hand side of the preceding display reduces to 27rmin{l, t}/\l — t\, as required. And in case 
t = 1, we have a — and 6 = 4, so that the integrand on the far right-hand side of the second-to-last display 
reduces to [1 — l/(z 2 + 1)]/16 and hence integrates to oo. □ 

Lemma B.2. For the n x p matrix V as in Section^ write Ai > A2 > • ■ • > A p for the ordered eigenvalues 
of V'V . If n and p are such that n — > 00 and p/n — > t for some t e [0, 1], then 



v 1 



2—1 

where the limit is to be interpreted as 00 in the case where t = 1 (and where the sum on the left-hand side is 
to be interpreted as 00 whenever A p = 0). Moreover, we also have Ai/n — > (1 + Vt) 2 and X p /n — > (1 — y/i) 2 
almost surely. 

Proof. Consider first the case where t satisfies < t < 1. Write v\ > V2 > ■ ■ ■ > v p for the ordered eigenvalues 
of V'V/n, and note that z/j = Xi/n, 1 < i < p. Write F n for the empirical cumulative distribution function 
(c.d.f.) of the z/j's, i.e., for F n (x) is the fraction of eigenvalues of V'V/n that do not exceed x, and note 

that F n is a random c.d.f. as it depends on V'V/ n. Under the maintained assumptions, F n converges weakly to 
a non-random limit F, except on a set of probability zero; cf. Theorem 3.6 of |Bai and Silverstein| ( [2010[ ). The 



limit c.d.f. F corresponds to the so-called Marchenko-Pastur distribution, which is supported by the interval 
[a, 6] with a = (1 — \/i) 2 and 6 = (1 + \fi) 2 , and which has a density given by f(x) = y/(x — a) (6 — x)/ (2iTtx) 
for a < x < b. For any bounded continuous function g(-), it follows that J g(x)F n (dx) — > J g(x)f(x)dx 
almost surely in view of the Portmanteau Theorem. And under the maintained assumptions, we also see 
that the smallest and the largest eigenvalue of V'V/n, i.e., v p = X p /n and v\ = X\/n, converge almost surely 
to a and to 6, respectively, whenever < t < 1; see Theorem 5.11 of |Bai and Silverstein ( 2010[ ). Recall 



that a = (1 — \/t) 2 is positive, and define a function g(-) as g(x) — 1/x if x > a/2 and set g(x) = 2/a 
otherwise. Now first note that J g(x)F n (dx) — > j g(x)f(x)dx, except on a probability zero event, because 
g(-) is continuous and bounded. Second, we have that J g(x)f(x)dx = f(l/x)f(x)dx because g(x) = 1/x 
on the support of /(•). And third, observe that J g(x)F n (dx) and f (l/x)F n (dx) converge to the same limit, 



<S 



except on a probability zero event, because v v converges to a almost surely. These three observations imply 
that 

-F n (dx) ^ f -f(x)dx. 

X J X 

But the left-hand side in the preceding display can also be written as |X)i=i u- = j> Sf=i X7 ' ano - ^he 

It follows that X)i=i 1A< 

converges to t/(l — 



i) in view of Lemma 



B.l 



right-hand side is equal to 1/(1 
almost surely, as required. 

In the case where t = 0, choose an arbitrary number t with < t < 1, and choose numbers p > p 
that go to infinity with n such that p/n — > £ as n — !► oo. Write V for the n x p matrix that forms the 
upper left block of the double array (Vij)i>i ,j>1j an d denote the eigenvalues of V"'V" by Ai > ■ • • > Xp. 
Because V is a sub-matrix of V, Cauchy's interlacing theorem entails that X)f=i — Ei=i V-^> ^ 
also that < A p < Ai < Ai. And from the case considered in the preceding paragraph, it follows that 
Y^i—i — * — t) , that Ap/n — > (1 — \/l) 2 , and that Ai/n — > (1 + V?) 2 , almost surely. Taken together, we 
obtain that hmsup^f =1 1/A, < t/(l — t), that (1 — Vt) 2 < liminf X p /n, and that limsup Ai/n < (1+ V 7 ?) 2 , 
almost surely. Letting 1 1 gives the desired result. 

In the case where £ = 1, we already know that \\/n — > 4 and \ p /n — > almost surely. Choose an 
arbitrary number t with < t < 1, and let p < p be such that p/n — > t. Now repeat the argument in the 
preceding paragraph, with the role of p and p exchanged, to conclude that t/(l — t) < liminf Y2i=i V^i 
almost surely. The result follows by letting t\\. □ 



Remark. The lemma continues to hold if Y2i=i V-^i ^ s replaced by \ >o 
adapted to cover also the case where t £ (l,oo]. 



And the lemma can be 



Throughout the following, we use symbols like xlW to denote a random variable that is chi-square 
distributed with k > 1 degrees of freedom and non-centrality parameter A > 0. Note that x|(A) has the 
same distribution as the sum of k + 2J\ i.i.d. central chi-square distributed random variables with one degree 
of freedom that are also independent of J\, where J\ is Poisson with mean A/2. In that sense, the law of 
X 2 (A) can be viewed as a central chi-square distribution with random degrees of freedom equal to k + 2J\, 
and we will also denote this distribution by xl+2j x - From this, it follows that x 2 (A) is stochastically larger 
than x 2 (0), whence E[(l/x 2 (A)) m ] < E[(l/x 2 (0))™ 1 ] for each m > 1 (where the expected values can be 
infinite). Also note that E[(l/x 2 (0)) m ] = — whenever k and m are positive integers satisfying 

k > 2m. 



Lemma B.3. Fix an integer m > 1, and consider random variables % 2 (A) with k > 2m. Then 



E 



k + X 



as k + X — > oo . 



Proof. Fix m > 1. The mean and the variance of xlW are gi yen by fc + A and 2{k + 2X), respectively. Using 
this fact and Chebyshev's inequality, we see that Xfe(A)/(fc + A) — > 1, and also that [(fc + A)/x|(A)]" 1 — > 1, in 
probability as fc + A — > oo. It remains to show, for integers k a and reals A , a > 1, satisfying k a > 2m, X a > 0, 
and k a + A Q — > oo as a — > oo, that the random variables [(fc a + X a )/x\ (A a )] ra ar e uniformly integrable. In 
other words, for each fixed e > 0, we need to find a constant M so that 



E 



kg + A a 

xUK) 




k a + A a 



> M 



< e 



(12) 



holds for each a, where we use symbols like {A} to denote the indicator function of the event A. Set 
U a = [(k a + X a )/x1 (A a )] m , note that the U a 's are all integrable because k a > 2m, and recall that x\ (^a) is 
distributed as x\ +2J (0) f° r an independent Poisson random variable J a with mean X a /2. For later use, we 
also recall the Chernoff bound P(J Q < A a /4) < (e/2)- A °/ 4 
of the U a 's, we may assume that either k a — ¥ oo or A a 
necessary (because k a + X a — > oo). 

Assume that k a — ¥ oo. Here, we first consider the (finitely many) a's for which fc Q < 4m. For these, we 
can find a constant Mi so that ([l2| holds whenever M > Mi because the U a 's are integrable. It remains 



To derive ( 12 1, i.e., to show uniform integrability 
—¥ oo as a — > oo, by switching to subsequences if 
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to consider the a's with k a > Am. For these a's, we derive uniform integrability by showing that the second 
moments of the Z7 a 's are bounded. Using the law of iterated expectation, E[£/^] is given by the sum of 



E 



n 



kg + A a 
-2i + 2J Xa 



Ux a > A a /4} 



(13) 



and 



E 



n 



kg + A a 



k a -2i + 2J Xa 



{J Xa < A a /4} 



(14) 



The expression in ( 13 1 is bounded by (4m + 3) 2m because the z-th fraction in the integrand of ( 13 1 is bounded 
by (k a + \ a ) / (k a -2i + \ a /2) < (k a + \ a ) / (k a -4m + \ a /2) < k a /(k a -4m) + 2 < 4m+l + 2 (the first inequality 
holds in view of J Xa > A a /4, and the last inequality holds because k a > 4m). To bound the expression in 
(fl4"|, note that the i-ih fraction in the integrand in ( 14 1 is bounded by k a /(k a — 2i) + A a < 4m + 1 + A a . 



Using this and the Chernoff's Poisson tail bound mentioned earlier, it follows that (14) is bounded by 
(4m + 1 + A a ) 2m (e/2)~ Aa / 4 . This upper bound is a continuous function of A a > that converges to zero as 
A a — > oo. Hence, the upper bound is itself bounded by a constant, uniformly in A a > 0, as required. 

Now assume that A a — > oo. We decompose the expression on the left-hand side of (12 1 into the sum 
of E[U a {U a > M}{J Xa < A a /4}] and E[U a {U a > M}{J\ a > A a /4}], and find M so that each of these 
two terms is bounded by e/2. To this end, note that \\ (Aa) ^ s stochastically larger than xt (0)j so that 
E[U a {U a > M}{J\ a < A a /4}] can be bounded by 



E 



kg + Aq 



P ( Jx a < A Q /4) < (2m + 1 + X a ) m (e/2)- x ^\ 



In the preceding display, the inequality is derived by writing the expected value as ni=i(^a ^a)/(k a — 2i), 
by noting that k a /(k a — 2i) < 2m + 1 because k a > 2m, and upon using Chernoff's tail bound for the 
Poisson. Since A a — > oo here, the upper bound in the preceding display is smaller than e/2 for sufficiently 
large A a 's, e.g., A a > A*. And for the (finitely many) a's for which A a < A*, we can find a constant Mi so that 
E[U a {U a > M}{J Xa < A Q /4}] is less than e/2 whenever M > M 2 . Lastly, E[U a {U a > M}{J Xa > A Q /4}] is 
bounded by 




E 



< 2 m E 



ka + K. 



ka + K 



' k a + fA Q / 2l 




> M 



rA./ai(°)y 

kg + rAq/2] ' 

Xfc.+rA a /2i(°) < 



> 2~ m M 



where \x] denotes the smallest integer not smaller than x, and where the inequality is based on the obser- 
vation that kg + X a = 2(fc a /2 + A Q /2) < 2(k a + [A a /2]). Set k a = k a + [A a /2] and set A Q = 0. To find an 
M3 > M2 such that the expression on the right-hand side of the preceding display is less than e/2 whenever 
M > A/3, it suffices to show that the random variables \(k a + A a )/x| (A a )] m are uniformly integrable. Since 



kg — > 00 as a —¥ 00, this has already been established in the second paragraph of the proof. 



□ 



Lemma B.4. Consider the n x p matrix V as in Section^ and let w be a unit-vector in R p (n > p > 3). 
Then E[wV'Vw/n] = 1 and Yax[w'V'Vw /n] — > as n —> oo, irrespective of the behavior of p; in particular, 
we have w'V'Vw/n 1 in probability. 



Proof. Setting W^ n ' = (X)?=i Vi,j w j) 2 i we see that w'V'Vw/n can be written as (l/n)J2i=i W- n> , i.e., as 



(n) 



the average of n i.i.d. random variables which have mean X^=i w ? = !• ^ nc ^ the variance of is given 
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by 

p p p 

Y, w a w b w c w d E[V i}a V i}b Vi, c V itd } -1 = E[V^] wt + 3 ^ w a w b - 1 

a.b.c.d— 1 a— 1 a,E>=l 

= (E[^ 1 ]-l)^^+2^«;> fe 2 < E^J + 1. 

a— 1 a.fa=l 

In the preceding display, the first equality is based on the fact that EfV^aV^!,, Vi >ci Vi t d] = whenever one of 
the indices a, b, c, d differs from the others, while the second equality and the inequality are derived from the 
facts that £)? =1 w] = 1 and that 1 < E[Vf tl ]. We hence can bound Vax[w'V'Vw/n] by (E^J + l)/n. □ 

Lemma B.5. Assume that Theorem^ applies. If t + S 2 > 0, then 



P 



P 



tS 2 
t + S 2 ' 



p + p'X'Xp t + 5 2 ' '~ p + P'X'Xp 

in probability as n — > oo; in case S 2 = oo, the two limits are to be interpreted as and as t, respectively. 
These statements continue to hold if p is replaced by p + k in the numerators of the preceding display, for 
some fixed k G N. 



Proof. Set w 



E 1 / 2 /3/( / 3'S/3) 1 / 2 , note that ||io| 



B.4 



1, and that P'X'Xp/n can be written as 



it follows that f3'X'X[3/n 5 2 in probability. Because P'X'XP/n 



{fi'Yifi) w'V'Vw/n. Using Lemma 

and /3'S/3 both converge to 5 2 (the former in probability), and because p/n — > t, the continuous mapping 
theorem gives both limits in case S 2 < oo, and the first limit in case S 2 = oo. In the remaining case where 
S 2 = oo, write the quantity of interest as (p/n)/[(p/n)/ f3'Y,/3 + w'V'Vw/n], which is easily seen to converge 
to t, as claimed. The last statement is trivial because p/n and (p + k)/n converge to the same limit. □ 



B.2 



have P20ml,(3,X) = 
ml,P,^) t/(l — t) in probability 
For /3(c), note that p2(P(c), (3, X) is given by (pf almost surely (where the expected values 



Proof of Theorem |2j For the maximum-likelihood estimator 
trace(EprX)- 1 ) = trace ((V'V)- 1 ) almost surely, so that p 2 (P m l , P ', X) 
by Lemma 



are to be understood as conditional on X, denoted by E[-||Jf] in the following); using Proposition A.l 
conditional on X, we can thus write pa(/3(c), P, X) also as 



trace((V r 'V r )" 1 ) ^1 - 2cE 
+ (c 2 + Ac/p) E 



P 



X 2 P (P'X'XP) 
p 2 P'Y,p 



X 



+ (c 2 + Ac/p) E 



U 2 p+2 (P'X'xp)y 



X 



(x 2 4 (P'X'xp)) 2 



X 



(15) 



almost surely. Our first goal is to show that sup 0<c<c \p2((3(c), 0, X) — r(S 2 ,c, t)\ — op(l). 



In the case where t + 6* 
bounded from above by 

trace((W)~ 1 ) ( 1 + 2CE 



0, note that r(0, c, 0) = 0, so that sup 0<c<c . \p2(P(c), P, X) — r(5 2 ,c, t)\ is 



Lx 2 (o) 



+ (C 2 +AC/p)E 



(x 2 P+2 m 2 



(C 2 +AC/ P )P'ZPE 



{x 2 p+ Mf 



because Xp+k(P'X'XP) is stochastically larger than Xp(0) f° r each k > 0. Because both trace((U'l / ) _1 ) and 
P'T,p converge to zero (the former in probability) and the expected values are bounded by 3 for each p > 3, 
it follows that the expression in the preceding display converges to zero in probability. 
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In the case where t + S 2 > 0, we use the formula for p2(f3(c), (3, X) in ( 15 1, the formula for r(5 2 , c, t), and 
the triangle inequality to bound sup 0<c<c \p2($(c), (3, X) — r{5 2 , c, t)\ from above by 



tTacedV'V)- 1 - 



t 



1 - t 
trace((y'V r )" 1 E 
p 2 p'Y,(3 



2C* 



trace((FV) _1 E 



P 



P 



E 



( X 2 (f3'X'Xf3)) 2 



( X 2 p+2 (P'X'Xp)) 2 

v 



X 



xliP'X'Xfi) 

t 3 



X 



t 2 



(l-t)(t + 6 2 ) 



(l-t){t + 5 2 ) 2 



4(C/p) trace((F'y)- 1 )E 



(t + 5 2 ) 2 
P 2 



( X 2 +2 (/3'X'X/3))2 



X 



We already know that the first term of the preceding display converges to zero in probability. For the 
remaining four terms, we note that either p — > oo (in case t > 0), or j3'X'X/3 — > oo in probability (in case 
5 2 > in view of Lemma 



B.4 



because n/3 1 ' X' 'X/3 jn can be written as nP'^Pjw'V' Vw I n), where the unit 



vector w is given by E 1/ ' 2 /3/(/3'S/3) 1 / 2 ). Using the assumption that p > 3, Lemma B.3 can be used to deal 
with the expected values in the preceding display. Together with the fact that trace((V V)^ 1 ) — > t/(l — t) 
in probability, this shows that the sum of the four remaining terms converges to the same limit as 



P 




t 



p + fi'X'Xp t + S 2 

l_ 

P'X'XP) 2 

P 2 



c 2 



t 



1 - 1 

t 2 6 2 



(p + 2 + P'X'X(3) 2 (t + S 2 ) 2 



1 - t (p + 2 + p'X'Xp) 2 



(t + 5 2 ) 2 



(p + 4 + P'X'Xpf 



Lemma B.5 now entails that the expressions in the preceding display converge to zero in probability. (For 
the last term in the preceding display, we either have p — >ooif£>0, ori = 0.) 

Our second goal is to show that sup 0<c<c \r(8 2 , c,p/n) — r(S 2 , c, t)\ also converges to zero in probability. 
We can bound the supremum in question by 



p/n 



1 — p/n 
+ C 2 



1-t 



2C 



(p/n) 2 



(1 — p/n)(p/n + 5 2 ) 



(p/n) 3 



(l-p/n)(p/n + S 2 ) 2 (l-t)(t + d 2 ) 2 



C 



t)(t + S 2 ) 

(p/n) 2 



(p/n + S 2 ) 2 



(t + 5 2 f 



The expression in the preceding display is the sum of four terms, where each term is of the form \f(p/n, S 2 ) — 
f(t, 5 2 )\ for some function /(•,•) which is continuous on [0,1) x [0, oo]. Since p/n — > t by assumption and 
S 2 — > 5 2 in probability (as is easy to see), the continuous mapping theorem entails that each of the four 
terms in the preceding display converges to zero in probability. □ 

Proof of Theorem |3j On the almost-sure event where X'X is invertible, p2($(c), fl, X) is given by 
the formula (pi For invertible X'X and for /3 such that /3'(X'X/n)P = d 2 , the expression in (U5| 
depends on f3 only through /3'E/3 which is at most d 2 / X P (V'V /n). where \ p (. . . ) denotes the small- 
est eigenvalue of the indicated matrix. [Indeed, we have d 2 = \\ (V V/n) 1 / 2 ^ 1 ^ 2 j3\\ 2 and /3'S/3 = 
/3'E 1 / 2 (V'y/n) 1 / 2 [(V'V/n)- 1 } (V'V /n) 1 ' 2 ^ 1 / 2 p < d 2 / X p (V'V/n).} It follows that su P/3eRP p 2 (^(c), /3, X) 
can almost surely be written as the supremum of 



[ 1- 2cE 


P 


_xl(nd 2 )_ 





+ (c 2 + 4c/p) E 



(x 2 P+2 (nd 2 W 



c 2 + Ac/p 
\(V'V/n) 



E 



p 2 d 2 

(xL 4 M 2 )) 2 



12 



over d 2 > or, equivalently, over d > 0. Write R*(d 2 ,c,n,p) for the random variable in the preceding 
display. The claim will follow if we can show that 

sup sup I i?* (d 2 , c, n,p) — R(d 2 7 c, t) I 

0<c<C d>0 

converges to zero in probability. Now use the triangle inequality to bound the expression in the preceding 
display by the supremum of 



trace((y' I/)" 1 ) - 
+ C 2 



t 



1 - t 

trace((W) _1 )E 



2C 



trace((y'V) _1 )E 



P 



Xp(nd 2 ) 



t z 



(l-t)(t + d 2 ) 



(X 2 p+ 2(nd 2 )) 2 



1 



E 



X P {V'V/n 
+ 4Ctracc((VV)~ 1 )E 



p 2 d 2 



(Xl +4 (nd 2 )) 2 



(x* +2 M 2 )) 2 



(l-t)(t + d 2 ) 2 
t 2 d 2 

(l-Vi) 2 {t + d 2 ) 2 
1 



4C* 



X p {V'V/n) 



E 



pd 2 



U 2 P+ i(nd 2 )) 2 



over d > 0. The first term of the preceding display converges to zero in probability and each of the remaining 
five terms is of the form \U n f n (d) — uf(d)\ multiplied by a constant, where U n is a random variable that 
converges to u € K in probability (cf. Lemma [B.2| , and where f n (d) and f(d) are functions of d 6 [0,oo). 
For each of the remaining five terms, we now proceed as follows: To show that sup d>0 \ U n f n (d) — uf (d) \ —> 
in probability, it suffices to show that \U n \ sup (J>0 \ f n (d) — f(d)\ — > in probability (because \U n f n (d)~ uf(d)\ 
is bounded by \U n \\f n (d)- f(d)\ + \f(d)\\U n -u\ < \U n \\f n (d) - f(d)\ + \U n - u\, where the inequality follows 
upon noting that f(d) < 1). Let d ni n > 1, be a sequence of maximizers (or near maximizers) of |/„ — f\. 
Because [0,oo] is compact, we may assume that d n — s- S E [0, oo] (replacing the original sequence by a 
convergent subsequence if necessary). Moreover, it is easy to see that lim^oo f(d) = (for each of the 
five remaining terms in the preceding display). In particular, we can extend /(•) to a continuous function 
on [0,oo] by setting /(oo) = 0. To complete the proof, we now show that either (a) f n (d n ) — f(5) — > as 
n — > oo, or (b) u = and \ f n (d n )\ is bounded. In the case where t + 5 2 > 0, we see that either p — > oo or 
nd n — > oo, and it is not difficult to conclude that case (a) occurs by using Lemma B.3 and the assumption 
that p > 3. And in the case where t + S 2 = 0, we see that case (b) occurs for the first, second, and fourth 
of the five remaining terms in the preceding display (by arguing as in the corresponding part in the proof of 
Theorem [2j mutatis mutandis), while again case (a) occurs for the third and the fifth as the expected values 
are bounded and d n tends to zero. □ 

Proof of Theorem |4j It suffices to show the statements for /3(c„). Set i?* = sup^2> R{5 2 ,c,t) for i?(-,-,-) 
as in Theorem 3 and note that supo gRP P2(/3(c n ), (3, X) — P2(Pml, P, X) converges to — t/(l — t) in 
probability by Theorems [2] and [3] For the first statement, assume either that < c < 2 and t > [(c— 2)/(c + 
2)] 2 or that c > 2 and t > hold (the case c = can not occur there because t < 1). The statement now 
follows by observing, for such c and t, that i?* — t/(l — t) > 0, i.e., that R(S 2 , c,t) — i/(l — t) > for some 
S 2 > (which is elementary but tedious to verify), and by setting e equal to, say, (i?* — t/(l — t))/2. For 
the second statement, assume that < c < 2 and t < [(c — 2)/(c + 2)] 2 , or that c > 2 and t — 0, and note, 
for such t and c, that R* - t/(l - 1) < 0, i.e., that R(S 2 ,c, t) - t/(l - t) < for each S 2 > 0. 

For the last statement, let c € [0,2], and take e > 0. We need to show that V(p2(/3(c n ), j3, X) — 
P2($ml, P, X) > e) converges to zero for arbitrary sequences of parameters f3 £ W and E. Because the set 
[0,oo] is compact, we may assume that /3'S/3 converges to a limit 5 2 £ [0,oo] (by considering convergent 
subsequences if necessary) . It now follows from Theorem [2] that p20(c n ), (3, X) converges in probability to 
the limit r(5 2 , c, t) as is given in the theorem, while P20ML, X) converges in probability to t/(l — t). The 
claim now follows from the fact that r(<5 2 ,c, t) < t/(l — t), irrespective of S 2 € [0, 00] (which, again, is easy 
if somewhat tedious to verify). □ 
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